Automated Data Collection with R Blog

The following is a guest post by Jana Blahak and Jan Dix (University of Konstanz), with support from Simon Munzert.

We are happy to introduce our freshly created rzeit package. It connects to the Content API at ZEIT Online, a German newspaper website. In short, the package allows you to

  • conduct an unfiltered search for articles,
  • use a variety of parameters to refine query fields, e.g. to specify content and time, and
  • easily inspect meta as well as article data.

The package is made publicly available at GitHub. In this blog post, we demonstrate basic features of the package. In a follow-up post, we will dig deeper into the matter and show how the package can be used to construct networks of popular German politicians.

Setup

Currently, the package is only available on GitHub. Using the devtools package, you can easily install it:

devtools::install_github("tollpatsch/rzeit")
library(rzeit)

In the following example, we also draw on additional, well-known packages:

library(rzeit)
library(stringr)
library(jsonlite)
library(lubridate)

Basic functions

To be able to work with the API, we have to fetch an API key first. There is no sophisticated authentication process involved here--just go to the developer page and sign up by providing your name and a valid email address.

With zeitSetApiKey, we provide a comfortable function that stores the key in the R environment You only have to do this once; the next time R is launched this key is automatically available and fetched by the package's functions:

zeitSetApiKey(apiKey = "set_your_api_key_here")

Next, we can start tapping the API. fromZeit represents the core function of the package. Again, because the API Key is stored in the environment, we do not have to pass the key explicitly (but still could do so using the api argument). As an example, we collect articles that include "Angela Merkel" in the article body, headline or byline:

results <- fromZeit(q = "Angela Merkel",
                    limit = "100",
                    dateBegin = "2015-06-01",
                    dateEnd = "2015-08-01")

Note that for the ease of exposition, we limited the number of collected results to 100 here using the limit argument. The maximum limit per call is 1000. Further, we restricted the search to articles that were published in a time period of about four months.

The results object is of class list and provides information about the articles found as well as the number of hits for a given period. To extract information about the latter, we can draw on the zeitFrequencies function, which takes the results object as main argument and returns a data frame that includes a continous list of dates in choosen sequences and the related frequencies:

freq <- zeitFrequencies(ls = results,
                        sort = "days",
                        save = FALSE) 
head(freq)
##         date dayCount freq freqPro
## 1 2015-07-06        1    3      30
## 2 2015-07-07        2    4      40
## 3 2015-07-08        3    9      90
## 4 2015-07-09        4    4      40
## 5 2015-07-10        5    1      10
## 6 2015-07-11        6    1      10

Apart from these meta data, we can also process substantive article information. The function zeitToDf converts the list returned from fromZeit into a data frame:

articles <- zeitToDf(ls = results,
                          sort = "days",
                          save = FALSE) 
names(articles)
## [1] "daynum"      "date"        "title"       "subtitle"    "snippet"    
## [6] "teaserText"  "teaserTitle" "link"

Finally, we offer the function zeitPlot that offers a first inspection of the collected time series. It plots date versus frequencies based on the frequency data frame returned by zeitFrequencies:

zeitPlot(df = freq) 

Example

So much for the package's basic functionality. In the following, we marginally modify our running example to demonstrate additional features of the fromZeit function.

Perform queries

Again, we are looking for articles on Angela Merkel. However, we now set the fromZeit argument multipleTokens = TRUE. The effect of this is that the API will return results both for "Angela" and "Merkel". There are more than 1000 hits in the given time span (covering all of 2013 and 2014), but given the limit of 1000, the function only returns the first thousand articles in descending order:

results_split <- fromZeit(q = "Angela Merkel",
                                 limit = "1000",
                                 dateBegin = "2013-01-01",
                                 dateEnd = "2014-12-31",
                                 multipleTokens = TRUE)

results_split <- zeitFrequencies(ls = results_split,
                                        sort = "month",
                                        save = FALSE)

frequencies_split <- results_split

For the second run, we set multipleTokens = FALSE. The API will return results for the entire string "Angela Merkel". Furthermore, we set limit = "1000" and split = FALSE. The default value for the split argument is TRUE, which allows us to circumvent the technical limit of 1000 articles per query for the whole time span, as the search is now split into monthly searches. It is unlikely that there are more than 1000 articles on Angela Merkel per month, so we expect to capture all relevant articles in the given time period. If we set split = FALSE, however, only the most recent 1000 hits (as defined with the limit argument) are returned:

results_withoutsplit <- fromZeit(q = "Angela Merkel",
                          split = FALSE,
                          limit = "1000",
                          dateBegin = "2013-01-01",
                          dateEnd = "2014-12-31",
                          multipleTokens = FALSE)

results_withoutsplit <- zeitFrequencies(ls = results_withoutsplit,
                                 sort = "month",
                                 save = FALSE)

frequencies_withoutsplit <- results_withoutsplit

Plot results

Lastly, we plot the results of the examples next to each other:

par(mfrow=c(1, 2))
zeitPlot(frequencies_withoutsplit, title = "without split", absolute = FALSE)
zeitPlot(frequencies_split, title = "with split", absolute = FALSE)

We see that using the second call with split = FALSE, we gathered data between February and December 2014, although we originally specified a longer time span. This is due to the limit of 1000 returned hits per call. This can still be of use, however, if you are primarily interested in a limited number of hits that do not necessarily have to cover the enire time span. Using the first call, we retrieved data for the entire time period. Now we see a peak in attention of articles published on ZEIT Online in the second half of 2013--just about when Mrs. Merkel was running for here third chancellorship at the federal election that took place on September 22, 2013.

We hope that this little introduction to the rzeit package inspired you to start your own analyses. The package is still under very active development on GitHub. As always, we are looking forward to hear about your experience with the package Stay tuned for another application of the package on our blog! Next, we will construct network data out of newspaper articles.